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(54) Rule induction on large noisy data sets 

(57) Efficient techniques for inducing rules used in 
classifying data items on a noisy data set The prior-art 
IREP technique, which produces a set of classification 
rules by inducing each rule and then pruning it and con- 
tinuing thus until a stopping condition is reached, is 
improved with a new rule-value metric for stopping prun- 
ing and with a stopping condition which depends on the 
description length of the rule set. The rule set which 
results from the improved IREP technique is then opti- 
mized by pruning rules from the set to minimize the 
description length and further optimized by making a 
replacement rule and a modified rule for each rule and 
using the description length to determine whether to 
use the replacement rule, the modified rule, or the orig- 
inal rule in the rule set. Further improvement is achieved 
by inducing rules for data items not covered by the orig- 
inal set and then pruning these rules. Still further 
improvement is gained by repeating the steps of induc- 
ing rules for data items not covered, pruning the rules, 
optimizing the rules, and again pruning for a fixed 
number of times. The fully-developed technique has the 
0(nlog 2 n) running time characteristic of IREP, but pro- 
duces rule sets which do a substantially better job of 
classification than those produced by IREP. 



FIG. 4 



401 



GROH RULE 






PRUNE RULE] 






A00T0 , 
RULESET 







-409 



-411 



402 




IREP*403 



CL 

UJ 



Printed by Rank Xerox (UK) Business Services 
2.13.11/3.4 



EP0752 648A1 

Description 

Background of the Invention 
s Field of the Invention 

The invention relates generally to machine learning techniques and more particularly to techniques for inducing classi- 
fication rules which will efficiently classify large, noisy sets of data. 

10 Description of the Prior Art 

Machine Classification: FIG. 1 

One of the most common human activities is classification. Given a set of objects, we classify the objects into subsets 

is according to attributes of the objects. For example, if the objects are bills, we may classify them according to the 
attribute of payment date, with overdue bills making up one subset, due bills another, and not yet due bills the third. 

Classification has always been expensive, and has accordingly always been mechanized to the extent permitted by 
technology. When the digital computer was developed, it was immediately applied to the task of classification. FIG. 1 
shows a prior-art classification system 101 which has been implemented using a digital processor 105 and a memory 

20 system 1 03 for storing digital data. Memory 1 03 contains unclassified data 1 07 and classifier 111. 

Unclassified data 107 is a set of data items 108. Each data item 108 includes attribute values 118(0..n) for a 
number of attributes 117(0..n). In the bill example, the attributes 117(i) of data items 108 representing bills would 
include the bill's due date and its past due date and the attribute values 1 18(i), for a given data item 108 would include 
the due date and the past due date for the bill represented by that data item. Classifier 1 1 1 includes a classifier program 

25 115. Operation of system 1 0 1 is as follows: processor 1 05 executes classifier program 1 1 5, which reads each data item 
108 from unclassified data 107 into processor 105. classifies the data item 108. and places it in classified data 109 
according to its class 1 10. In the bill example, there would be three classes 110, not yet due, due, and overdue. 

While it is possible to build a classifier program 105 in which the classification logic is built into the program, rt is 
common practice to separate classification logic 1 13 from the program, so that all that is necessary to use the program 

30 to classify different kinds of items is to change classification logic 113. One common kind of classification logic 1 1 3 is a 
set of rules 119. Each rule consists of a sequence of logical expressions 121 and a class specifier 123. Each logical 
expression 121 has an attribute 125 of the data items being classified, a logical operator such as =, <, >, or s, and 
a value 131 with which the value of attribute 125 is to be compared. Continuing with the bill example, classifier logic 113 
for the bills would be made up of three rules: 

35 past_due_date < curr_date -> overdue 
due_date =< curr_date AND 
past_due_date >= curr_date -> due 
due_date > curr_date -> not yet due 

The expression to the right of the --> symbol is the class to which the rule assigns a data item; the expression to the left 
40 is the sequence of logical expressions. To classify a data item, classifier program 1 1 5 applies rules 1 1 9 to the data item 
until one is found for which all of the logical expressions are true. The class specified for that rule is the class of the data 
item. Thus, if the due date for a bill is June 1 , the past due date June 1 5, and the current date June 8. executing program 
1 1 5 with the above set of rules will result in the application of the second rule above to the data item for the bill, and that 
will in turn classify the bill as being "due". 

45 

Inducing Classifier Logic 113: FIG. 2 

Building classifier logic 1 13 for something as simple as the bill classification system is easily done by hand; however, 
as classification systems grow in complexity, it becomes necessary to automate the construction of classifier logic 113. 

so The art has consequently developed systems for inducing a set of rules from a set of data items which have been 
labeled with their classifications. 

FIG. 2 shows such a system 201 , again implemented in a processor and memory. System 201 includes classified 
data 201 and induction program 205. Classified data 201 is simply a set of data items 1 08 in which each data item 1 08 
has been classified. As shown at 203, each classified data item 203 includes values for a number of attributes and a 

55 class specifier 123 for the class to which the data item belongs. Classifier logic 1 1 3 is produced by executing induction 
program 205 on classified data 20 1 . 

There are two techniques known in the art for inducing classifier logic 1 13. In the first technique, induction program 
205 begins by building classifier logic 1 1 3 that at first contains much more logic than is optimum for correctly classifying 
the data items and then prunes classifier logic 1 13 to reduce its size. In the second technique, classifier logic 1 13 is 
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built piece by piece, with construction stopping when classifier logic 1 1 3 has reached the right size. 

The first technique, in which classifier logic 1 1 3 is first made much larger than necessary and then pruned, is exem- 
plified by the C4.5 system, described in J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufman, San 
Mateo, CA, 1993. In this system, induction program 205 produces a decision tree from classified data 201 which cor- 
rectly classifies the data and then prunes the decision tree. One version of C4.5, called C4.5RULES, converts the 
unpruned decision tree to a set of rules by traversing the decision tree from the root to each leaf in turn. The result of 
each traversal to a leaf is a rule. The set of rules is then pruned to produce a smaller set which will also correctly classify 
the data. 

The drawback of this technique is that it does not work well with example sets that are large and noisy. In the 
machine learning context, a noisy data set is one which does not permit generation of a set of rules in which a classifi- 
cation produced by a given rule is exactly correct but rather only permits generation of a set of rules in which the clas- 
sification produced by a given rule is probably correct. As the size and/or the noisiness of the example data set 
increase, the technique becomes expensive in terms of both computation time and memory space. With regard to time, 
the technique's time requirements asymptotically approach 0(a) 4 ), where n is the number of classified data items 203 
in classified data 201 . With regard to space, the technique requires that the entire decision tree be constructed in mem- 
ory and in the case of the rule version, that there be storage space for all of the rules produced from the decision tree. 
Some improvement of the foregoing is possible with problems where there are only two classes of data items, but even 
the improved technique requires 0(r?) time and Ofn 2 ) space. 

The second technique is much less expensive in terms of time and space. This technique, called Incremental 
Reduced Error Pruning, or I REP, is explained in detail in Johannes FOrnkranz and Gerhard Widmer, "Incremental 
reduced error pruning", in: Machine Learning: Proceedings of the Eleventh Annual Conference, Morgan Kaufmann, 
New Brunswick, NJ, 1994. IREP builds up classifier logic 1 13 as a set of rules, one rule at a time. After a rule is found, 
all examples covered by the rule (both positive and negative) are deleted from classified data 201. This process is 
repeated until there are no positive examples, or until the last rule found by IREP has an unacceptably large error rate. 

In order to build a rule, IREP uses the following strategy. First, the examples from classified data 201 which are not 
covered by any rule are randomly partitioned into two subsets, a growing set and a pruning set. 

Next, a rule is "grown" using a technique such a FOIL, described in detail in J.R. Quinlan and R.M. Cameron-Jones, 
"FOIL: a Midterm Report", in: Pavel B. Brazdii, ed., Machine Learning: ECML-1993, {Lecture Notes in Computer 
Science #667), Springer- Verlag, Vienna, Austria, 1993. FOIL begins with an empty conjunction of conditions, and con- 
siders adding to this any condition of the form A n m v , A c * 0, or Ac £ e, where A n is a nominal attribute and v is a legal 
value for A n . or A c is a continuous variable and e is some value for A c that occurs in the training data. A condition is 
selected to be added when adding the condition maximizes FOIL'S information gain criterion. Conditions are added until 
the rule covers no negative examples from the growing dataset. 

Once grown, the rule is immediately pruned. Pruning is implemented by deleting a single final condition of the rule 
and choosing the deletion that maximizes the function 

v{Rule, PrunePos, PruneNeg) - p *^'fl (1) 

where P (respectively N) is the total number of examples in PrunePos [PruneNeg) and p (n) is the number of examples 
in PrunePos {PruneNeg) covered by Rule. This process is repeated until no deletion improves the value of v. Rules 
thus grown and pruned are added to the rule set until the accuracy of the last rule added is less than the accuracy of 
the empty rule. 

IREP does indeed overcome the time and space problems posed by the first technique. IREP has a running time 
of O(nlotfn) and because it grows its rule set, also has far smaller space requirements than the first technique. Exper- 
iments with IREP and C4.5RULES suggest that it would take about 79 CPU years for C4.5RULES to produce a rule set 
from an example data set having 500,000 data items, while IREP can produce a rule set from that data set in 7 CPU 
minutes. IREP is thus fast enough to be used in many interactive applications, while C4.5RULES is not. There are how- 
ever two problems with IREP. The first is that rule sets made using the first technique make substantially fewer classifi- 
cation errors than those made using IREP. The second is that IREP fails to converge on some data sets, that is, 
exposing IREP to more classified examples from these data sets does not reduce the error rate of the rules. 

It is an object of the invention to provide a technique for inducing a set of rules which has time and space require- 
ments on the order of those for IREP, but which converges and produces sets of rules which classify as well as those 
produced by the first technique. 

Summary of the Invention 

The foregoing and other problems of the art are solved by making a rule set which is substantially smaller than the larg- 
est rule set that can be made by the method being used and then producing a final rule set by optimizing the original 
rule set with regard to the rule set as a whole. Making a small rule set gives the time and space advantages of the IREP 
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approach, while optimization with regard to the rule set as a whole substantially improves the quality of the classification 
produced by the rule set. 

A particularly advantageous way of optimizing with regard to the rule set as a whole is to optimize so as to reduce 
the description length of the rule set The invention features two types of such optimization. In one type, rules are 
5 pruned from the rule set to reduce the description length. In another type, rules in the rule set are modified to reduce 
the description length. In a preferred embodiment, the rule set is first pruned and the pruned rule set is then modified. 

Additional improvement is achieved by iterating with any example data items which are not covered by the rules in 
the optimized rule set. New rules are generated for those data items as described above and added to the rule set pro- 
duced by the first iteration. The new rule set is then optimized. Iteration may continue a fixed number of times or until 
10 there are no data items which are not correctly classified by the rule set. 

In other aspects of the invention, the set of rules is produced by inducing the rules one by one and pruning each 
rule as it is produced. Production of rules continues until a stopping condition is satisfied. The invention further provides 
better techniques for pruning individual rules and a better rule value metric for determining when to stop pruning a rule. 
Also provided is a stopping condition for the rule set which is based on the description length of the rule set with the 
75 new rule relative to the smallest description length obtained for any of the rule sets thus far. Finally, I REP has been 
improved to support missing attributes, numerical variables, and multiple classes. 

Other objects and advantages of the apparatus and methods disclosed herein will be apparent to those of ordinary 
skill in the art upon perusal of the following Drawing and Detailed Description, wherein: 

20 Brief Description of the Drawing 

FIG. 1 is a block diagram of a prior-art classifier; 
FIG. 2 is a block diagram of a prior-art system for inducing classifier logic; 
FIG. 3 is a diagram of modules in an induction program; 
25 FIG. 4 is a flowchart of a first rule induction method; 

FIG. 5 is a flowchart of a second rule induction method; 
FIG. 6 is a flowchart of a third rule induction method; 

FIG. 7 is pseudo-code for a preferred embodiment of a first portion of the method of FIG. 6; 
FIG. 8 is pseudo-code for a preferred embodiment of a second portion of the method of FIG. 6; 
30 FIG. 9 is pseudo-code for a preferred embodiment of a third portion of the method of FIG. 6; 

FIG. 10 is pseudo-code for a preferred embodiment of a fourth portion of the method of FIG. 6; and 
FIG. 1 1 is pseudo-code for a preferred embodiment of a fifth portion of the method of FIG. 6. 

Reference numbers in the Drawing have two parts: the two least-significant digits are the number of an item in a 
35 figure; the remaining digits are the number of the figure in which the item first appears. Thus, an item with the reference 
number 201 first appears in FIG. 2. 

Detailed Description of a Preferred Embodiment 

40 In the following, the new technique for inducing a set of rules is described in three stages: first, an improved version of 
IREP called IREP* is presented; then a technique for optimizing the rule set produced by IREP* is set forth; next, a 
method which combines IREP* and the optimization is described. This method is termed RIPPER (for Repeated Incre- 
mental Pruning to Produce Error Reduction). Finally, an iterative version of RIPPER called RIPPER/c is presented. 
Thereupon, details are provided of the preferred embodiments implementation of salient portions of IREP* and RIP- 

45 PER. 

IREP*: FIG. 4 

A flowchart 401 for IREP* 403 is shown in FIG. 4. The first part of IREP* 403 is loop 414, which builds the set of rules 
50 rule-for-rule. At step 407, a rule 1 1 9{i) is grown in the fashion described above for IREP. The next step, 407, is to prune 
rule 119(i). In contrast to IREP, any final sequence of conditions in rule 119(i) is considered for pruning and that 
sequence is retained which maximizes the rule-value metric function 

v*[Rule, PrunePos, PruneNeg) = 
55 p+n 

The above function is for rules that classify the data items into two classes, p represents the number of positive data 
items, that is, those that the rule successfully classifies as members of the rule's class, n represents the number of 
negative data items, that is those that the rule successfully classifies as not being members of the rule's class. After 
rule 1 1 9(i) has been grown and pruned, it is added to rule set 120 (41 1). 
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Decision block 413 determines whether the stopping condition for rule set 120 has been met If it has not, loop 414 
is repeated. Proper choice of the stopping condition ensures that rule set 120 is large enough to properly classify the 
data but small enough to avoid the time and space problems of techniques such as those used in the C4.5 system, in 
the preferred embodiment the stopping condition is determined as follows using the Minimum Description Length Prin- 
ciple. As set forth at Quinlan, C4.5: Programs for Machine Learning, supra, p. 51 f., the principle states that the best set 
of rules derivable from the training data will minimize the number of bits required to encode a message consisting of the 
set of rules together with the those data items which are not correctly classified by the rules and are therefore excep- 
tions to them. The length of this message for a given set of rules is the description length of the rule set, and the best 
rule set is the one with the minimum description length. 

In IREP* 403, the description length is used like this to determine whether the rule set is large enough: After each 
rule is added, the description length for the new rule set is computed. IREP* 403 stops adding rules when this descrip- 
tion length is more than d bits larger than the smallest description length obtained for any rule set so far, or when there 
are no more positive examples. In the preferred embodiment, d = 64. 

In the preferred embodiment, the scheme used to encode the description length of a rule set and its exceptions is 
described in J. Ross Quinlan, "MDL and categorical theories (continued)", in: Machine Learning: Proceedings of the 
Twelfth International Conference, Lake Tahoe, CA, 1995, Morgan Kaufmann. One put of this encoding scheme can be 
used to determine the number of bits needed to send a rule with k conditions. The part of interest allows one to identify 
a subset of k elements of a known set of n elements using 

S(n,k t p) - /clog 2 l+(n-fr)log 2 ^ 

bits, where p is known by the recipient of the message. Thus we allow \\k\\ + S(n,k,k/n) bits to send a rule with k con- 
ditions, where n is the number of possible conditions that could appear in a rule and ||k|| is the number of bits needed 
to send the integer k. The estimated number of bits required to send the theory is then multiplied by 0.5 to adjust for 
possible redundancy in the attributes. 

The number of bits needed to send exceptions is determined as follows, where T is the number of exceptions, C is 
the number of examples covered. U is the number of examples not covered, e is the number of errors, fp is the number 
of false positive errors, and fn is the number of false negative errors. The number of bits to send exceptions is then 

if (C > T/2) then 

log(T + 1) + 5(C,/p, e/2C) + S(U, fnJn/U) 

else 

log(T + 1) + S(U, /n, e/2U) + S(C, fp, fp/C) 



After the stopping condition has been met. the rule set is pruned in step 415. The pruning is done in a preferred embod- 
iment by examining each rule in turn (starting with the last rule added), computing the description length of the rule set 
with and without the rule, and deleting any rule whose absence reduces the description length. 

Together, the rule-value metric used in pruning step 409 and the stopping metric used in stopping condition 413 of 
IREP* 403 substantially improve IREP's performance. IREP* 403 converges on data sets upon which IREP falls to con- 
verge and the rule sets produced using IREP* 403 do substantially better at making correct classifications than those 
produced using IREP. In tests on a suite of data sets used for determining the performance of systems for inducing 
rules, sets of rules produced by IREP* 403 had 6% more classification errors than sets of rules produced by 
C4.5RULES, while sets of rules produced by IREP had 13% more errors. 

IREP* improves on other aspects of IREP as well. As originally implemented, IREP did not support missing 
attribute values in a data item, attributes with numerical values, or multiple classes. Missing attribute values are handled 
like this: all tests involving the attribute A are defined to fail on instances for which the value of A is missing. This 
encourages IREP* to separate out the positive examples using tests that are known to succeed. 

IREP* or any method which induces rules that can distinguish two classes can be extended to handle multiple 
classes in this fashion: First, the classes are ordered. In the preferred embodiment the ordering is always in increasing 

order of prevalence — i.e., the ordering is C k where Cj is the least prevalent class and C k is the most prevalent. 

Then, the two-class rule induction method is used to find a rule set that separates C\ from the remaining classes; this 
is done by splitting the example data into a class of positive data which includes only examples labeled Q and a class 
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of negative data which contains examples of all the other classes and then calling the two-class rule induction method 
to induce rules for Cy When this is done, all data items classified as belonging to Cj by the those rules are removed 
from the data set. Next, all instances covered by the learned rule set are removed from the dataset. The above process 
is repeated with each of the remaining classes ->C k until only Cj< remains; this class will be used as the default 
class. 

Optimization off the Rule Set: FIG. 4 

A problem with IREP is that the effect of a given rule on the quality of the set of rules as a whole is never considered. 
I REP* 403 begins to deal with this problem with step 415 of pruning the rule set, as described above. A further 
approach to dealing with this problem is optimization step 41 7. The aim of the optimization is to modify the rules in the 
rule set so as to minimize the error of the entire rule set. 

In the preferred embodiment, the method used in optimization step 417 is the following: Given a rule set 120 

/?1 A*, consider each rule in turn: first R^ , then flg, etc, in the order in which they were induced. For each rule R h two 

alternative rules are constructed. The replacement for f?,- is formed by growing and then pruning a rule Rj, where prun- 
ing is guided so as to minimize error of the entire rule set R A fl*/,...,/?* on the pruning data. The revision of R, is 

formed analogously, except that the revision is grown by greedily adding conditions to R h rather than the empty rule. 
Finally, the decision length technique described above is used to determine whether the final rule set 120 should 
include the revised rule, the replacement rule, or the original rule. This is done by inserting each of the variants of Rj 
into the rule set and then deleting rules that increase the description length of the rules and examples. The description 
length of the examples and the simplified rule set is then used to compare variants of /?,• and the variant is chosen which 
produces the rule set with the shortest description length. 

RIPPER: FiG. 5 

IREP* 403 and optimization step 41 7 are employed in RIPPER method 501 shown in FIG. 5. A first rule set is obtained 
by using IREP* 403 and optimizing the result (41 7), as shown in flowchart 401 ; then as indicated in decision block 503, 
the rule set is applied to the example data items to see if there are any data items which are not covered by the rule 
set, i.e., which are not correctly classified by the rule set. If there are, as shown in block 509, loop 51 1 uses IREP* 403 
to add rules to the rule set until ail examples arc covered. RIPPER 501 further improves over IREP: rule sets produced 
by RIPPER 501 now make only 1% more classification errors than those produced by C4.5RULES. 

RIPPERfr: FIG. 6 

Further performance improvements can be obtained by placing loop 51 1 from RIPPER 501 in another loop which iter- 
ates finding data items not covered by the rule set, adding rules for those data items to the set of rules to produce an 
augmented rule set, and then optimizing the augmented rule set using the techniques described above for IREP*. This 
version of the technique, called RIPPERfc, where k is the number of iterations, is shown in FIG. 6. 

RIPPER/c 601 begins with the steps of flowchart 401 (i.e., IREP* 403 plus optimization 41 7); it then enters loop615, 
which it executes a fixed number of times. On each iteration of loop 615, RIPPER loop 51 1 is executed to obtain a rule 
set which covers all of the examples. This rule set is then optimized in step 613 in the fashion described above with 
regard to optimization step 417 and thereupon pruned as described with regard to pruning step 415. This final version 
of the technique was run on the trial data sets with k = 2. The rule set produced by RIPPER2 Was as good at classifying 
as that produced by C4.5RULES and RIPPER2 retained the 0{nlotfn) running time characteristic of IREP. 

Details off a Preferred Embodiment: FIGS. 3, 7-11 

The foregoing techniques are implemented in a preferred embodiment by means of an improved induction program 
301 , shown in FIG. 2. Induction program 301 includes two sets of components. One set of components 303 makes the 
rule set; the other set, 305, optimizes the rule set. Rule set making components 303 include a rule growing component 
307, which grows individual rules, a rule pruning component 309 which prunes the rules and includes the rule value 
metric, a stopping condition component 31 1 which determines whether further rules should be added to the rule set, 
and a rule set pruning component, which prunes the rule set. Rule set optimizer 305 includes a component 315 for mak- 
ing replacement rules, a component 317 for making revised rules, and a deciding component 319 for deciding whether 
to use the original rule, the replacement rule, or the revised rule in the rule set. 

Pseudo-Code for the Preferred Embodiment 

FIGS 7-1 1 present pseudo-code for an implementation of the above components together with the control logic required 
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for RIPPER/c, in the preferred embodiment. The implementation is a two-class classification system; as described 
above, such a two-class classification system can be used to implement a multi-class classification system. 

ripper 701 

5 

Beginning with FIG. 7, ripper 701 is the top level function which implements RIPPER/c in the preferred embodiment It 
takes a set of classified examples 201 as an argument and returns a set of rules hyp. The part of ripper labeled 403 
implements IREP*, while the part of it labeled 601 implements RIPPER/c. At 703 t ripper invokes the function 
add_ru!es, which implements loop 414 of flowchart 401 and produces a first set of rules for the dataset and classifica- 
w tion. Then the funciton reduce_dlen prunes the rule set, and thus implements step 41 5 of flowchart 401 . 

The pruned rule set is then interatively optimized k times in loop 704, which thus implements loop 615 of FIG. 6. In 
loop 704, the function optimize_rules 707 implements process steps 61 1 and 613, with the function add_rules adding 
rules for data items not covered by the current rule set, and the function reduce_dlen performs pruning step 614. When 
loop 704 has run the prescribed number of times, ripper returns the final rule set. 

15 

addjrules 

Continuing with the functions invoked by ripper, add_rules is shown at 801 in FIG. 8. The first step, 803, is removing 
any examples covered by a rule that is already in the rule set from the example data. Then new rules are added in loop 
20 804 until the stopping condition occurs. To build each rule, the example data is first partitioned into a set of data for 
growing the rule and a set of data for testing it for pruning purposes (805). Then the new rule is built (806). Construction 
starts with an "empty rule" that has the class "+" (since this is a two-class classifier) and an empty set of logical expres- 
sions 121. In the case of a multiple class system, the empty rule would have the class for which rules were currently 
being made. 

25 At 807, the refine function adds the logical expressions 121 to the rule. This function is shown at 903 in FIG. 9. 
Loop 904 adds logical expressions one at a time until there are no negative examples covered by the rule. As each log- 
ical expression is added, its information gain is computed as shown at 907 in ref_value function 905. When the stop- 
ping condition for adding logical expressions is reached, the rule is returned; otherwise, the logical expression is added 
to the rule and negative examples no longer covered by the refined rule are removed from the data set and the loop is 

30 repeated. 

Next, at 809, the simplify function prunes the new rule, simplify is shown in more detail at 909. Loop 910 of the 
function performs different prunings; for each pruning, the function gen_ value computes the rule-value metric. If the 
rule-value metric for the current pruning is better than the best previously achieved, the pruning is retained; otherwise, 
it is deleted. When a pruning is retained, the negative examples not covered by the pruning are removed from the data 

35 set and the loop is repeated. gen_value is shown in detail in FIG. 10 at 1001 . The part of gen_value which is of impor- 
tance for the present discussion is at 1005, where the rule-value metric discussed supra is shown at 1007. 

At 81 1 , the function reject_rule is invoked to check the stopping condition. Pseudo-code for the function is at 901 . 
As shown there, the preferred embodiment has two stopping conditions. The first one to be checked (91 1) uses the 
description length and indicates that the stopping condition has occurred when the description length which results 

40 when the current rule is added to the rule set is larger than the shortest description length yet attained for the rule set 
by an amount which is greater than or equal to the constant amount MAXJDECOMPRESSION. If this stopping condi- 
tion has not occurred, the function checks at 913 whether the rule to be added has an error rate of more than 50%; 
again, if it does, the function indicates that the stopping condition has occurred. When the stopping condition has 
occurred, the variable last_rule_accepted is set to FALSE, which terminates loop 804. If the stopping condition has 

45 not occurred, the examples covered by the new rule are removed from the data (81 3) and the new rule is added to the 
rule set (815). 

reduce_dlen 

so The reduce_dlen function (705) prunes the rule set produced by add_rules. The function is shown in detail at 1 109 in 
FIG. 1 1 . The function consists mostly of loop 1111, which, for each rule in turn, makes a copy of the current rule set 
without the rule and then computes the description lengths of the current rule set with and without the rule. If the current 
rule set without the rule has the shorter description length (1113), that rule set becomes the current rule set. The 
description length is computed by the function totaLdlen, shown at 1 1 15. total_dlen first uses the function data-dlen 

55 to compute the description length of the data items which are exceptions to the current rule set (1 1 1 7) and then makes 
the description length for the entire rule set. As shown at 1 1 1 9, that is done by starting with the description length of the 
data items and then adding to it the description length of each rule in turn. As for data_dlen, that function is shown in 
detail at 1 101. The function simply implements the method described in the Quinlan 1995 reference discussed supra. 
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optimize_rules 

This function takes the rule set produced by I REP* 403 and optimizes it by making a new rule for each rule in the rule 
set, making a modified rule for each rule in the rule set, and then using the description lengths of the rule set with the 
original rule, with the new rule, and with the modified rule to select one of the three for inclusion in the optimized rule 
set. The function contains loop 71 2, which is executed for each rule in the rule set. For each rule, the function saves the 
old rule (710). It then makes a new rule (71 3) in the same manner as explained for add_rules; next it makes a modified 
rule (715) by adding logical expressions to the old rule. Adding and pruning are again done as explained for addjules. 
Next, the rule that yields the rule set with the shortest description length is chosen (717). Then the examples covered 
by the chosen rule are removed from the example data (721). 

The function used to compute the description length is relative_compressJon, shown in detail in FIG. 10 at 1009. 
The function first produces a copy of the rule set with the rule and prunes it using reduce_dlen (101 1); then it does the 
same with a copy of the rule set without the rule (1013); then it computes the description length of the exceptions for 
each of the pruned rule sets (1015), and finally it returns the difference between the description length for the excep- 
tions for the rule set without the rule and the sum of the description length for the exceptions for the rule set with the 
rule plus the description length of the rule (1017). The computation of the description lengths is done using data_dlen 
as already described above. 

Conclusion 

The foregoing Detailed Description has disclosed to those skilled in the art the best mode presently known to the inven- 
tor of practicing his techniques for inducing rule sets for classifiers from example data sets. The techniques disclosed 
herein produce rule sets which are as accurate as those produced by systems such as C4.5, but the production of the 
rule sets requires far fewer computational resources. Resources are saved by producing a rule set which has "just 
enough" rules; accuracy is obtained by the stopping conditions used to terminate rule pruning and rule set growth and 
by optimization techniques which optimize the rule set with regard to the rule set as a whole. Iteration increases the 
effectiveness of the optimization techniques. A particular advantage of the techniques disclosed herein is their use of 
description length to determine the stopping condition and to optimize the rule set 

As will be immediately apparant to those skilled in the art, many embodiments of the techniques other than those 
disclosed herein are possible. For example, the preferred embodiment uses an improvement of I REP to produce the 
rule set; however, any other technique may be used which similarly produces "just enough" rules. Further, the preferred 
embodiment uses description length to optimize with regard to the entire rule set; however, other optimization tech- 
niques which optimize with regard to the entire rule set may be used as well. Moreover, optimization techniques other 
than the pruning and modification techniques disclosed herein may be employed. Finally, those skilled in the art are 
easily capable of producing implementations of the principles of the invention other than the implementation disclosed 
in the pseudo-code. 

All of the above being the case, the foregoing Detailed Description is to be understood as being in every respect 
illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined 
from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the 

law. 

Claims 

1 . A method practiced in a computer system which includes a processor and a memory system of inducing a second 
set of rules for classifying data items from an example dataset of the data items, the rules and the example dataset 
being stored in the memory system 

and the method comprising the steps performed in the processor of: 

inducing a first set of rules from the example data set according to a predetermined method, the first set being 
substantially smaller than the largest set producible by the predetermined method, and storing the first set in 
the memory means; and 

optimizing the first set with regard to the entire first set to produce the second set. 

2. The method set forth in claim 1 further comprising the steps of: 

after producing the second set, producing a third set of rules by adding rules to the second set to cover data 
items from the example data set not covered by the second set; and 
optimizing the third set to produce a new second set. 



8 



EP0752 648A1 



3. The method set forth in claim 2 wherein: 

the method is iterated n times and the second set is the new second set produced in the nth iteration. 

4. The method set forth in any of claim 1 , 2, or 3 wherein: 

the slip of optimizing the first set or the third set includes the step of computing a description length for the 
first set or the third set and using the description length in the optimization. 

5. The method set forth in claim 4 wherein: 

the step of optimizing the first set or the third set includes the step of pruning the first set of rules with regard 
to the entire first set. 

6. The method set forth in claim 5 and further comprising: 

the step of pruning each rule as the rule is induced to maximize the function ^ where p is the number of 
positive examples for the rule in the set of examples and n is the number of negative examples for the rule. 

7. The method set forth in claim 5 or claim 6 wherein: 

the step of pruning the first set of rules is done by deleting rules from the first set such that the description 
length of the first set is reduced. 

8. The method set forth in any of claim 1 , 2, or 3 wherein the step of optimizing the first set or the step of optimizing 
the third set comprises the steps performed for each rule in the first or third set of: 

making a modification of the rule and pruning the modification of minimize the error of the entire rule set; and 
determining from the description length of the rule and the modification whether to replace the rule with the 
modification. 

9. The method set forth in claim 8 wherein: 

the step of making a modification comprises the steps of 

making a first modification independently of the rule and 

making a second modification by adding conditions to the rule; and 

the step of determining determines whether to replace the rule with the first modification or the second modifi- 
cation. 

10. The method set forth in claim 8 wherein the optimization further comprises the step of: 

pruning the first set of rules by deleting rules from the set such that the description length of the first set is 
reduced. 

1 1 . The method set forth in any of claims 1 , 2, or 3 wherein: 

the step of inducing the first set of rules is performed by inducing the rules rule-by-rule until a predetermined 
stopping condition occurs. 

12. The method set forth in claim 1 1 wherein: 

the step of inducing the first set of rules includes the step of checking the description length of the first set 
of rules to determine whether the stopping condition has occurred. 

13. The method set forth in claim 12 wherein: 

the step of checking the description length of the first set of rules is performed repeatedly and includes the 
step of comparing the description length of the current rule set with the shortest description length thus far obtained 
to determine whether the stopping condition has occurred. 

14. The method set forth in claim 13 wherein: 

the step of comparing the description length determines that the stopping condition has occurred when the 
description length of the current rule is more than a predetermined value larger than the shortest description length. 

15. A method practiced in a computer system which includes a processor and a memory system of inducing a set of 
rules for classifying data items from an example dataset of the data items, the rules and the example dataset being 
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stored in the memory system 

and the method comprising the steps performed in the processor of: 
for each rule, 

inducing the rule on the example dataset; 
adding the rule to the set of rules; 

computing the description length of the set of rules with the added rule; and 
terminating the method if the description length satisfies a predetermined condition. 

1 6. The method of claim 1 5 wherein: 

the predetermined condition is a description length which is a predetermined amount larger than the small- 
est previously-computed description length. 

17. The method of claim 15 further comprising the step performed for each rule of: 

pruning the rule to maximize the function ^ where p is the number of positive examples for the rule in the 
set of examples and n is the number of negative examples for the rule. 
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FIG. 4 
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FIG. 6 
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ripper (data) 

// data is a set of examples; 
// returns a ruleset 

/* this is IREP* */ 
703 v hyp = erapty_rule jet; 



FIG. 7 



705 



I 



hyp = add .rules (data, hyp) ; > 403 
hyp = reducejlen (hyp, data); , 
/» this is RIPPEa being iterated k times »/ 

for i=l k { y— 707 s 

hyp * optimizejules (data, hyp); 
hyp 3 add jules (data, hyp); J> 704 
. hyp 3 reducejlen (hyp, data); 

return hyp; 



701 



^601 



/x optimize the rulest hyp, and possibly add some more rules 
* a special case is when hyp is empty— then this builds a ruleset 7og 

optimizejules (data, hyp) 

// data is a set of examples; 

// hyp is the ruleset that computed the last time around 

/x optimize existing rules x/ 

< for rule nua * 1 (number of rules in hyp) { 

/» split the data into growing and pruning sets */ 
partition (data, groujata. prune data) ; 



/» save the old rule */ 
old jule ■ hyp [rule jum]; 



710 



712 < 



/x build a new rule x/ 

new jule 3 new rule with empty body asserting class to be V; 
new jule ■ refine (new jule. grow data); \ 713 

new jule 3 simplify (new_rule. hyp. rule jum, prune Jata); 

/x build a revised rule x/ 
revised jule ■ hyp (rule Jim); 

revised jule * refine (revised jule. grow data); f 715 

revised jule 3 simplify (revised_rule. hyp, rule jum, prune Jata) \ t 

/x pick one of the old, new or revised rules x/ 



[new j-ule, hyp, rule num. data) ; 
Revised jule. hyp, rule jum, data) ; 
old jule. hyp, rule jum, data) ; 



newjal ■ relativejompression 
revjal 3 relativejompression 
old val 3 relative compression,.^ 
if Toldjal > 3 newjal and oldjal > 3 rev val) { 

chosen jule 3 oldjule; 
1 else if (revjal > 3 newjal) { 

chosen rule 3 revised rule; 

1 else ( 

j chosen jule 3 new jule; 

remove examples covered by chosen jule from data; 
hyp (rule jium] 3 chosen jule; 



> 717 



•721 
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FIG. 8 



add_rules (data, hyp) 

// data is aset of examples; 

// hyp is a ruleset; 

{ .803 
remove examples covered by any rules in hyp from data; J 

/x add new rules for uncovered examples x/ 1P1 
while ((there are positive examples in data) 
^ and last_rule_accepted) 

/* split the data into growing and pruning sets x/ 
partition (data, growjata, prune data) ; 

805 

/* build a new rule x/ 
807 > hew.rule = new rule with empty body assserting class to be *h 
\»new_rule = refine (new rule, grow data) ; 
804 { ^new_rule = simplify (new_rule. hyp. rulejium. prune.data) ; 



809 

/x decide if you should keep the new rule x/ 
^i_f, (reject_pule (new_rule. data) } I 
81 i" lastjule accepted = FALSE; on 

} else I / 
remove examples covered by new rule from data;'' 



I 

return hyp; 



append the newjule to hyp; 
I 815 
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/* decide if a net* rule should be added to the ruleset */ & 
reject.rule (rule. data. hyp. rule.num) 



I 

return TRUE; 

} 

else if (error rate of rule on data > 5M) { 
return TRUE; 

} , 
else { 

return FALSE; 



if (total compression of hyp with rule and posn rulenum >= 1 — 
best compression seen to date + MAXJECOMPRESSION) I 



> 911 



913 



/» grow the rule »/ 
refine (rule, data) 

last.refinement .rejected » FALSE; 
while (negative examples in data are covered by rule) { 

refinement » refinement ref of rule with max ref value (rule, ref, data)- 
if (reject.refinement (refinement. rule. data)) (" 
m < j else jl3st_refineiient_reject » TRUE; 903 

rule « refinement; 
. remove from data examples not covered by refine; 

. ) 

I return rule; 

/* value function used in refining a rule */ 
ref .value (oldjule, ref ined_rule. data) 

pi - number of positive examples in data covered by old.rule; — 
nl ■ number of negative examples in data covered by old rule: 
p2 3 number of positive examples in data covered by refined rule; 
n2 = number of negative examples in data covered by refined~rule; 
/if return 'information gain' »/ 
j return p2* (log2 ( (pl+nl) /pi) - log2 ( (p2+n2) /p2) )-^ 907 

/» generalize a rule »/ ft „ 
simplify (rule. hyp. rule jum. data) i?! 

f while (body of rule is not empty) ( 

gen = generalization of rule with best gen.value (rule. hyp. rule num. data); 
if (value of gen <= value of rule) f ■ 
910 < , break; 
1 else ( 

rule ■ gen; 

u 1 

^ return rule; 
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FIG. 10 

/* value function used in generalization »/ -^i 
jjenjalue (rule. hyp. rule jium. data) 

if (rule nun < #rules in hyp) ( 

7* optimizing a rule — use accuracy of rule in context */ 

hypl 3 copy of hyp; 

hypl (rulejium) 3 rule 

e 5 numberjfjrrors made by hypl on data; 

tot = number of examples in data; 

return 1 - e/tot; 

} else ( 

/* use heuristic function from paper */ 
p = number of positive examples in data covered by rule; 
n = number of negative examples in data covered by rule; 
return (p_n)/(p*n)x 
j 1 M007 

/* reduction in description length obtained by inserting rule 

in hyp at position rule.num, relative to deleting that rule JM 
from the hypothesis 

*/ 

relative.compression (rule, hyp, rulejium, data) 

nulljule 3 new rule with body 'false' asserting class to beV % 

hyp with 3 copy of hyp; 

hyp with [rule num] 3 rule; 

hyp_with 3 reducejlen(hyp_with.data); 

hyp.without 3 copy of hyp; [ 
hyp with[rule_nuffll 3 null rule; f 1013 

hyp.without 3 reduce_dlenlnyp_without data);J 

dlen with 3 data dlen (hyp.with. data) ; 1 mt . 
dlenjithout 3 datajlen (hyp_without, data) ; r 1U13 

return dlen without - dlen_with+rule_dlen(rule); 
1017 
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FIG. 11 

A description length of data given a hypothesis — ie number of bits 
x needed to encode the exceptions to the predictions made by hyp »/ 
datajlen (hyp. data) 

apply rules in hyp to data and compute these statistics: 1101 
fp = ffalse positives; 
fn = ffalse negatives; 
cov = ^examples covered; 
uncov = ^examples not covered; 

/* return (bits to encode the exceptions (fp, fn) 
using the method of (Quinlan,Mt95) 

*/ 

define subset_dlen(n,e,p) = -Log2(p)*e + -Log2 (1-p) *(n-e) 
e s fn+fp; 

if (cov >= uncov) ( 
return 
Log2 (cov+uncov+i) 
+subset Jlen (cov, fp.0.5*e/cov) 
tsubset Jlen uncov, fn, fn/uncov) ; 

} else ( 

return 

Log2 (cov+uncov+i) 
+subset Jlen (uncov, fn, 0.5*e/uncpv) 
^subset Jlen (cov, fp, fn/uncov) ; 



/* reduce description length of hypothesis by deleting bad rules */ 
reduce dlen (hyp, data) 

( U09 
n = frules in hyp; 

for i=n l do 

<nM , hypl » copy of hyp with rule i deleted; 

1011 ] ^if (total jllen (hypl. data) < totaljlen (hyp, data)) { 

1113 — ^ ^ hyp* hypl; 

endfor 
j return hyp; 

/* total description length of hypothesis and data */ 
totaljlen (hyp, data) 

1 U15 
n = frules in hyp; 

1117 ^-tot = datajlen (hyp, data) ; ^ mg 

for i=n l do tot t s rule Jlen (hyp (il); 

return tot; 
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